Deep Learning Math · Chapter 07 · Capstone

Build an LLM
from scratch

You've met six ideas one at a time. Now watch them snap together into a single machine that reads, understands, and writes. No new magic — just the pieces you already own, assembled into a mind.

ASSEMBLE

The one job

A language model does exactly one thing: guess the next word.

That's the whole trick. Give it “The cat sat on the…” and it returns a probability for every possible next word. Pick one, glue it on, and ask again. Do that a few hundred times and you have an essay. Everything else is plumbing to make that one guess good.

Here is the entire pipeline a word travels through. Each stage is powered by a chapter you've already finished — the colors tell you which.

Tokenize

new here

Embed

Ch.4 vectors

+ Position

Ch.1 waves

Attention

Ch.4 + Ch.6

Transform

Ch.5 matrices

Predict

Ch.6 softmax

↻ repeat: append the new word, run again

STEP 1 a missing piece

Tokenize — chop text into bite-size pieces.

A model can't read letters; it reads numbers. So first we slice text into tokens — whole words, or chunks of them — and hand each a fixed ID from a dictionary of maybe 50,000 entries. “Tokenization” itself becomes token·ization.

The·464 cat·2543 sat·7891 on·319 the·464

STEP 2 ← powered by Chapter 4 · Vectors

Embed — turn each token into an arrow of meaning.

An ID is just a name tag — it carries no meaning. So each token is looked up in a giant table and replaced by a vector: a list of hundreds of numbers, an arrow in a vast space of meaning. The model learns this table so that related words land near each other, and directions become concepts. That's Chapter 4, exactly: king − man + woman lands near queen.

STEP 3 ← powered by Chapter 1 · Trigonometry

Add position — stamp each word with a wave.

There's a catch: the model looks at all words at once and has no built-in sense of order. “Dog bites man” and “man bites dog” would look identical. The fix is beautiful — add a blend of sine and cosine waves of different frequencies to each position, a unique fingerprint of “where am I in the sentence.” The waves you spun in Chapter 1 are how an LLM knows word order.

STEP 4 Ch.4 dot product Ch.6 softmax

Attention — let every word look at every other.

This is the heart of the transformer — and it's just two ideas you already have, holding hands. To understand a word, the model asks: which other words should I pay attention to? It scores every pair with a dot product (how aligned are their arrows?), shrinks the scores a little (dividing by √dₖ) so they don’t blow up, then runs them through softmax so they become percentages of attention. Each word becomes a blend of the others’ value vectors, weighted by those percentages. Pick a word and watch where it looks.

the word doing the looking

how much “{{ queryWord }}” attends to each word

{{ bar.word }}

{{ bar.pct }}

real models run this with hundreds of dimensions and many “heads” at once — but it is exactly this: score by scaled dot product, normalize by softmax, blend the value vectors.

STEP 5 ← powered by Chapter 5 · Matrices

Transform — then stack it, dozens deep.

After attention mixes information between words, each word's vector is pushed through a small feed-forward network — a couple of matrix multiplications that reshape it into a richer representation. Attention + transform together make one layer. Then we do it again. And again — GPT-style models stack dozens of identical layers, each refining the meaning a little more, exactly the “chain of transformations” from Chapter 5.

STEP 6 ← powered by Chapter 6 · Probability

Predict — turn the final vector into a guess.

After the last layer, the final word's vector is multiplied out into one score for every token in the vocabulary. Softmax squashes those scores into a probability distribution — and temperature decides how boldly to choose. Sample one word, append it to the sentence, and run the whole pipeline again. That loop — predict, append, repeat — is called autoregression, and it's literally the toy you played with at the end of Chapter 6.

STEP 7 Ch.2 calculus Ch.6 probability

But where do all those numbers come from? Training.

A fresh model is random — the embedding table, the attention weights, every matrix is noise. We fix that by showing it oceans of real text with the next word hidden, and asking it to guess. When it's wrong, we measure how surprised it should have been by the true answer — the cross-entropy loss from Chapter 6.

Then Chapter 2 takes over. We ask the derivative of that loss with respect to every single weight — which way is downhill? — and nudge them all a hair in that direction. That's gradient descent, run with the chain rule (“backpropagation”), billions of times. Slowly, the noise becomes a model that knows the world.

The whole machine, on one page

Six ideas. One mind.

Ch.1 · Trigonometry

Sine waves tell the model the order of the words.

Ch.2 · Calculus

Gradients tell every weight which way to improve.

Ch.3 · Diff. Equations

Training is a flow downhill; sibling models (diffusion) solve one directly.

Ch.4 · Vectors

Words become arrows; dot products drive attention.

Ch.5 · Matrices

Every layer is a matrix that transforms the meaning.

Ch.6 · Probability

Softmax makes the guess; cross-entropy grades it.

There was never any magic.

A language model is tokens turned into arrows, stamped with waves, mixed by attention, reshaped by matrices, collapsed into a probability, and tuned by gradients. Six ideas, each of which you can now picture with your own hands.

What makes a language model work was never out of reach. It was only ever a story told in scattered pieces — and now you have all of it, end to end.

Build an LLMfrom scratch